Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
#for data manipulation
import numpy as np
import pandas as pd
#for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
#for statistics
import scipy.stats as stats
#for scaling data with z-score
from sklearn.preprocessing import StandardScaler
#for calculating distances
from scipy.spatial.distance import cdist
#for k-means clustering
from sklearn.cluster import KMeans
#for calculating silhouette scores
from sklearn.metrics import silhouette_score
#for graphing and dsiplaying silhouette score and elbow curve
#installing yellowbrick
!pip install yellowbrick
from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer
#for calculating distances
from scipy.spatial.distance import pdist
#for hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
#for making dendrograms and calculating cophenetic correlation
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet
#for ignoring warnings
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: yellowbrick in c:\users\neha\anaconda3\lib\site-packages (1.5) Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.2.1) Requirement already satisfied: cycler>=0.10.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: numpy>=1.16.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.23.5) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (3.7.0) Requirement already satisfied: scipy>=1.0.0 in c:\users\neha\anaconda3\lib\site-packages (from yellowbrick) (1.10.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4) Requirement already satisfied: contourpy>=1.0.1 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.0.5) Requirement already satisfied: packaging>=20.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (22.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (10.0.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\neha\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\neha\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\neha\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.1) Requirement already satisfied: six>=1.5 in c:\users\neha\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
#importing the data from csv file into a dataframe called stocks
stocks = pd.read_csv('stock_data.csv')
#first five rows of stocks
stocks.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
#last five rows of stocks
stocks.tail()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
#viewing number of rows and columns in stocks
stocks.shape
(340, 15)
There are 340 rows and 15 columns in stocks.
#viewing stocks column datatypes and further information
stocks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
#viewing the statistical summary of all of the stocks columns
stocks.describe(include='all')
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 340 | 340 | 340 | 340 | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 3.400000e+02 | 3.400000e+02 | 340.000000 | 3.400000e+02 | 340.000000 | 340.000000 |
| unique | 340 | 340 | 11 | 104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | AAL | American Airlines Group | Industrials | Oil & Gas Exploration & Production | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | 1 | 53 | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | 80.862345 | 4.078194 | 1.525976 | 39.597059 | 70.023529 | 5.553762e+07 | 1.494385e+09 | 2.776662 | 5.770283e+08 | 32.612563 | -1.718249 |
| std | NaN | NaN | NaN | NaN | 98.055086 | 12.006338 | 0.591798 | 96.547538 | 90.421331 | 1.946365e+09 | 3.940150e+09 | 6.587779 | 8.458496e+08 | 44.348731 | 13.966912 |
| min | NaN | NaN | NaN | NaN | 4.500000 | -47.129693 | 0.733163 | 1.000000 | 0.000000 | -1.120800e+10 | -2.352800e+10 | -61.200000 | 2.767216e+07 | 2.935451 | -76.119077 |
| 25% | NaN | NaN | NaN | NaN | 38.555000 | -0.939484 | 1.134878 | 9.750000 | 18.000000 | -1.939065e+08 | 3.523012e+08 | 1.557500 | 1.588482e+08 | 15.044653 | -4.352056 |
| 50% | NaN | NaN | NaN | NaN | 59.705000 | 4.819505 | 1.385593 | 15.000000 | 47.000000 | 2.098000e+06 | 7.073360e+08 | 2.895000 | 3.096751e+08 | 20.819876 | -1.067170 |
| 75% | NaN | NaN | NaN | NaN | 92.880001 | 10.695493 | 1.695549 | 27.000000 | 99.000000 | 1.698108e+08 | 1.899000e+09 | 4.620000 | 5.731175e+08 | 31.764755 | 3.917066 |
| max | NaN | NaN | NaN | NaN | 1274.949951 | 55.051683 | 4.580042 | 917.000000 | 958.000000 | 2.076400e+10 | 2.444200e+10 | 50.090000 | 6.159292e+09 | 528.039074 | 129.064585 |
#viewing duplicate values in stocks
stocks.duplicated().sum()
0
There are no duplicate values in stocks.
#viewing missing values in stocks
stocks.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
There are no missing values in stocks.
#creating histogram for every variable in stocks
#color is set to thistle
#excluding Ticker Symbol and Security columns because it is known they are unique for each row
for i in stocks.columns: #for each column in stocks
if (i=='Ticker Symbol' or i=='Security'): # if the column is Ticker Symbol or Security
pass #skip the column
elif (i== 'GICS Sector'): #if the column is GICS Sector
sns.histplot(data=stocks, x=i, color='thistle') #plot the column on x-axis
plt.title(i) #title is column name
plt.grid(False) #do not display grid
plt.xticks(rotation=90) #rotate x-axis labels 90 degrees
plt.show(); #display histogram
elif (i=='GICS Sub Industry'): #if column is GICS Sub Industry
sns.histplot(data=stocks, x=i, color='thistle') #plot column on x-axis
plt.title(i) #title is column name
plt.grid(False) #do not display grid
plt.xticks(fontsize=4, rotation=90) #x-axis labels are rotated 90 degrees/font size is decreased to 4
plt.show();#display histogram
else: #for all other columns
sns.histplot(data=stocks, x=i, color='thistle') #plot column on x-axis
plt.title(i) #title is column name
plt.grid(False) #do not display grid
plt.show(); #display histogram
#adding the numerical columns of stocks to a variable called stocks_num
stocks_num = stocks.select_dtypes(np.number)
#using stocks_num columns to make heatmap
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(stocks_num.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
#using stocks_num columns to make pairplot, including kde on the diagonals
sns.pairplot(data=stocks_num, diag_kind='kde')
plt.show(); #displaying pairplot
1) What does the distribution of stock prices look like?
#creating histogram using data from stocks, plotting Current Price on the x-axis
#setting color to thistle
sns.histplot(data=stocks, x='Current Price', color='thistle')
plt.title(' Distribution of Current Stock Prices') #setting title of histogram
plt.xlabel('Current Stock Price') #setting title of x-axis
plt.ylabel('Number of Stocks') #setting title of y-axis
plt.grid(False) #not displaying grid
plt.show(); #displaying histogram
2) The stocks of which economic sector have seen the maximum price increase on average?
#organizing stocks by GICS sector
#finding the mean price change for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['Price Change'].mean().sort_values(ascending=False)
GICS Sector Health Care 9.585652 Consumer Staples 8.684750 Information Technology 7.217476 Telecommunications Services 6.956980 Real Estate 6.205548 Consumer Discretionary 5.846093 Materials 5.589738 Financials 3.865406 Industrials 2.833127 Utilities 0.803657 Energy -10.228289 Name: Price Change, dtype: float64
The maximum average price increase is seen in the Health Care sector.
3) How are the different variables correlated with each other?
#viewing the correlation heatmap for numerical columns in stocks
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(stocks_num.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
4) Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
#organizing stocks by GICS sector
#finding the mean cash ratio for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
GICS Sector Information Technology 149.818182 Telecommunications Services 117.000000 Health Care 103.775000 Financials 98.591837 Consumer Staples 70.947368 Energy 51.133333 Real Estate 50.111111 Consumer Discretionary 49.575000 Materials 41.700000 Industrials 36.188679 Utilities 13.625000 Name: Cash Ratio, dtype: float64
5) P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
#organizing stocks by GICS sector
#finding the mean P/E ratio for each GICS sector
#sorting the means in descending order
stocks.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
GICS Sector Energy 72.897709 Information Technology 43.782546 Real Estate 43.065585 Health Care 41.135272 Consumer Discretionary 35.211613 Consumer Staples 25.521195 Materials 24.585352 Utilities 18.719412 Industrials 18.259380 Financials 16.023151 Telecommunications Services 12.222578 Name: P/E Ratio, dtype: float64
#viewing duplicate values in stocks
stocks.duplicated().sum()
0
#viewing missing values in stocks
stocks.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
There are no missing or duplicate values in stocks.
#making boxplots for all numerical columns in stocks, using previously created variable stocks_num
for i in stocks_num.columns: #for each column in stocks_num
sns.boxplot(data=stocks_num, x=i) #make boxplot by plotting column on x-axis
plt.title(i) #title of boxplot is column name
plt.grid(False) #not displaying grid
plt.show(); #displaying boxplot
Most of the variables have outliers, but these outliers represent authentic data points. It is best to keep them as they are.
The variables do not need to be changed. The numerical variables have already been isolated into another dataframe called stocks_num. These columns need to be scaled before clustering.
#assinging the standard scalar to a variable called scaler
scaler = StandardScaler()
#copying the stocks_num dataframe into a variable called num_columns
num_columns = stocks_num.copy()
#scaling num_columns using scaler variable, representing StandardScalar()
#saving the scaled values in scaled_columns variable
scaled_values = scaler.fit_transform(num_columns)
#adding the values into a new dataframe scaled_stocks
#scaled_stocks will contain scaled_values and the columns from num_columns
scaled_stocks = pd.DataFrame(scaled_values, columns=num_columns.columns)
#viewing the first five rows of scaled_stocks
scaled_stocks.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
#creating histogram for every variable in scaled_stocks
#color is set to thistle
for i in scaled_stocks.columns: #for each column in scaled_stocks
sns.histplot(data=scaled_stocks, x=i, color='thistle') #plot column on x-axis
plt.title(i) #title is column name
plt.grid(False) #do not display grid
plt.show(); #display histogram
#using scaled_stocks columns to make heatmap
#labels are limited to 2 decimal places/range is from -1 to 1
sns.heatmap(scaled_stocks.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
There are no major changes in the distribution of data in each column or correlation between different variables after data preprocessing and scaling.
#setting clusters from 1 to 10 (not including 10)
num_clusters = range(2, 10)
#creating an empty list for mean_distortions
mean_distortions = []
for k in num_clusters: #for each value in the range of num_clusters
model = KMeans(n_clusters=k) #assigning model variable the KMeans function with number of clusters equal to value
model.fit(scaled_stocks) #fit the scaled_stocks data to KMeans function
prediction = model.predict(scaled_stocks) #make prediction for values in scaled_stocks
distortion = (sum(np.min(cdist(scaled_stocks, model.cluster_centers_, 'euclidean'), axis=1))
/scaled_stocks.shape[0]) #find distortion using euclidean distances
mean_distortions.append(distortion) #add distortion value to the mean_distortions list
#print the value and the corresponding mean distortion
print(k, 'Clusters', 'Mean Distortion:', distortion)
2 Clusters Mean Distortion: 2.382318498894466 3 Clusters Mean Distortion: 2.2692367155390745 4 Clusters Mean Distortion: 2.179645269703779 5 Clusters Mean Distortion: 2.1129944992818515 6 Clusters Mean Distortion: 2.0565797933792824 7 Clusters Mean Distortion: 2.0307068651453446 8 Clusters Mean Distortion: 1.9666240276860545 9 Clusters Mean Distortion: 1.9274833859398008
#plotting the number of clusters with their corresponding mean distortions
plt.plot(num_clusters, mean_distortions)
plt.title('Mean Distortion vs. k') #setting title of graph
plt.xlabel('k') #setting title of x-axis
plt.ylabel('Mean Distortion') #setting title of y-axis
plt.show(); #displaying graph
According to the elbow method, the ideal k value seems to be either 6 or 7.
#setting clusters from 2 to 10 (not including 10)
clusters_num = range(2,10)
silhouette_scores = [] #setting silhouette_scores to empty list
for k in clusters_num: #for each value in clusters_num
model1 = KMeans(n_clusters=k) #assign model1 the KMeans function for that number of clusters
prediction1 = model1.fit_predict(scaled_stocks) #make prediction using scaled_stocks values
score = silhouette_score(scaled_stocks, prediction1) #claculate silhouette score using prediction
silhouette_scores.append(score) #add score to silhouette_scores
#print k and the corresponding silhouette score
print(k, 'Clusters', 'Silhouette Score:', score)
2 Clusters Silhouette Score: 0.43969639509980457 3 Clusters Silhouette Score: 0.45755884975007327 4 Clusters Silhouette Score: 0.45483520750820555 5 Clusters Silhouette Score: 0.4033714342513622 6 Clusters Silhouette Score: 0.42287350755988 7 Clusters Silhouette Score: 0.4179608494109058 8 Clusters Silhouette Score: 0.40232990858584977 9 Clusters Silhouette Score: 0.41161415393845907
#creating a plot for number of clusters and corresponding silhouette score
plt.plot(clusters_num, silhouette_scores)
plt.title('Silhouette Scores vs. k') #setting title of graph
plt.xlabel('k') #setting title of x-axis
plt.ylabel('Silhouette Score') #setting title of y-axis
plt.show(); #displaying graph
According to the Silhouette scores plot, the ideal k value seems to be 7.
#assigning visualizer to the SilhouetteVisualizer function where KMeans has 7 clusters
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(scaled_stocks) #fitting visualizer to scaled_stocks values
visualizer.show() #displaying visualization
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
#assigning visualizer to the SilhouetteVisualizer function where KMeans has 6 clusters
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(scaled_stocks) #fitting visualizer to scaled_stocks values
visualizer.show() #displaying visualization
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
A k value of 6 and a k value of 7 both have the same average silhouette score of 0.4. Even the Elbow curve indicates that 6 and 7 can be good values. However, in the plot of the Silhouette scores, there is a steeper line at 7.
It seems best to proceed with 7 as the k value.
#adding the KMeans function with 7 clusters into a variable called kmeans_model
kmeans_model = KMeans(n_clusters=7, random_state=0)
#fitting the scaled_stocks values to the kmeans_model
kmeans_model.fit(scaled_stocks)
KMeans(n_clusters=7, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=7, random_state=0)
#adding a new column called kmeans_cluster to the original stocks
stocks['kmeans_cluster'] = kmeans_model.labels_
#saving the data sorted by column average for each cluster into clusters variable
clusters = stocks.groupby('kmeans_cluster').mean()
#adding total value column into stocks to include the total number of values in each cluster
clusters['Total Values'] = stocks.groupby('kmeans_cluster')['Current Price'].count()
#reassigning numerical columns of stocks to stocks_num
stocks_num = stocks.select_dtypes(np.number)
#creating a boxplot for each column in stocks_num vs. cluster
for i in stocks_num.columns: #for each column in stocks_num
sns.boxplot(data=stocks_num, x='kmeans_cluster', y=i) #plotting clusters on x-axis and column on y
plt.title(i + ' vs. Clusters') #setting title of boxplot
plt.xlabel('Clusters') #setting title of x-axis
plt.ylabel(i) #setting title of y-axis
plt.show(); #displaying boxplot
#printing the GICS sectors included in each cluster
print('Cluster 1:\n', stocks[stocks['kmeans_cluster'] == 0]['GICS Sector'].unique())
print('Cluster 2:\n', stocks[stocks['kmeans_cluster'] == 1]['GICS Sector'].unique())
print('Cluster 3:\n', stocks[stocks['kmeans_cluster'] == 2]['GICS Sector'].unique())
print('Cluster 4:\n', stocks[stocks['kmeans_cluster'] == 3]['GICS Sector'].unique())
print('Cluster 5:\n', stocks[stocks['kmeans_cluster'] == 4]['GICS Sector'].unique())
print('Cluster 6:\n', stocks[stocks['kmeans_cluster'] == 5]['GICS Sector'].unique())
print('Cluster 7:\n', stocks[stocks['kmeans_cluster'] == 6]['GICS Sector'].unique())
Cluster 1: ['Financials' 'Consumer Discretionary' 'Health Care' 'Information Technology' 'Consumer Staples' 'Telecommunications Services' 'Energy'] Cluster 2: ['Industrials' 'Health Care' 'Consumer Staples' 'Utilities' 'Financials' 'Real Estate' 'Information Technology' 'Materials' 'Consumer Discretionary' 'Telecommunications Services' 'Energy'] Cluster 3: ['Energy'] Cluster 4: ['Industrials' 'Consumer Discretionary' 'Consumer Staples' 'Financials'] Cluster 5: ['Information Technology' 'Consumer Discretionary' 'Health Care'] Cluster 6: ['Information Technology' 'Health Care' 'Real Estate' 'Telecommunications Services' 'Energy' 'Consumer Discretionary' 'Consumer Staples' 'Materials'] Cluster 7: ['Energy' 'Industrials' 'Materials' 'Information Technology']
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Cluster 5:
Cluster 6:
Cluster 7:
#adding the different distance metric methods into distances
distances=['mahalanobis', 'euclidean', 'cityblock', 'chebyshev']
#adding the different linkage methods into linkage
linkages = ['complete', 'weighted', 'single', 'average']
high_i_l = [0,0]
high_correlation = 0
for i in distances: #for each value in distances
for l in linkages: #for each value in linkages
Z = linkage(scaled_stocks, metric=i, method=l) #use the distance metric and linkage method on scaled_stocks values
c, coph_dists = cophenet(Z, pdist(scaled_stocks)) #find the cophenetic correlation of the values in scaled_stocks
#print the cophenetic correlation, distance metric, and linkage values
print('Cophenetic Correlation:', c, 'Distance:', i, 'Linkage:', l)
if high_correlation < c:
high_correlation = c
high_i_l[0] = i
high_i_l[1] = l
Cophenetic Correlation: 0.7925307202850002 Distance: mahalanobis Linkage: complete Cophenetic Correlation: 0.8708317490180428 Distance: mahalanobis Linkage: weighted Cophenetic Correlation: 0.9259195530524591 Distance: mahalanobis Linkage: single Cophenetic Correlation: 0.9247324030159737 Distance: mahalanobis Linkage: average Cophenetic Correlation: 0.7873280186580672 Distance: euclidean Linkage: complete Cophenetic Correlation: 0.8693784298129404 Distance: euclidean Linkage: weighted Cophenetic Correlation: 0.9232271494002922 Distance: euclidean Linkage: single Cophenetic Correlation: 0.9422540609560814 Distance: euclidean Linkage: average Cophenetic Correlation: 0.7375328863205818 Distance: cityblock Linkage: complete Cophenetic Correlation: 0.731045513520281 Distance: cityblock Linkage: weighted Cophenetic Correlation: 0.9334186366528574 Distance: cityblock Linkage: single Cophenetic Correlation: 0.9302145048594667 Distance: cityblock Linkage: average Cophenetic Correlation: 0.598891419111242 Distance: chebyshev Linkage: complete Cophenetic Correlation: 0.9127355892367 Distance: chebyshev Linkage: weighted Cophenetic Correlation: 0.9062538164750717 Distance: chebyshev Linkage: single Cophenetic Correlation: 0.9338265528030499 Distance: chebyshev Linkage: average
0.942 is the highest cophenetic correlation, and the corresponding distance metric used for it is Euclidean distance. The corresponding linkage method used for it is average.
#adding the different linkage methods into linkage
linkages = ['single', 'weighted', 'ward', 'centroid', 'complete', 'average']
high_i_l = [0,0]
high_correlation = 0
for l in linkages: #for each value in linkages
Z = linkage(scaled_stocks, metric='euclidean', method=l) #use the euclidean distance metric and linkage method on scaled_stocks values
c, coph_dists = cophenet(Z, pdist(scaled_stocks)) #find the cophenetic correlation of the values in scaled_stocks
#print the cophenetic correlation, distance metric, and linkage values
print('Cophenetic Correlation:', c, 'Linkage:', l)
plt.figure(figsize=(10, 5)) #figure size is set to (10,5)
plt.title('Dendrogram for Linkage:'+ l) #setting title of dendrogram
plt.grid(False) #not displaying grid
dendrogram(Z) #making dendrogram
plt.show(); #displaying dendrogram
if high_correlation < c:
high_correlation = c
high_i_l[0] = 'euclidean'
high_i_l[1] = l
Cophenetic Correlation: 0.9232271494002922 Linkage: single
Cophenetic Correlation: 0.8693784298129404 Linkage: weighted
Cophenetic Correlation: 0.7101180299865353 Linkage: ward
Cophenetic Correlation: 0.9314012446828154 Linkage: centroid
Cophenetic Correlation: 0.7873280186580672 Linkage: complete
Cophenetic Correlation: 0.9422540609560814 Linkage: average
#building a hierarchical clustering model using 7 clusters, euclidean distance, and average linkage
hier_model = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='average')
hier_model.fit(scaled_stocks) #fitting the model to scaled_stocks
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=7)
#adding the clusters to a new column called hier_clusters in stocks dataframe
stocks['hier_cluster'] = hier_model.labels_
#reassigning numerical columns of stocks to stocks_num
stocks_num = stocks.select_dtypes(np.number)
#creating a boxplot for each column in stocks_num1 vs. cluster
for i in stocks_num.columns: #for each column in stocks_num
sns.boxplot(data=stocks_num, x='hier_cluster', y=i) #plotting clusters on x-axis and column on y
plt.title(i + ' vs. Clusters') #setting title of boxplot
plt.xlabel('Clusters') #setting title of x-axis
plt.ylabel(i) #setting title of y-axis
plt.show(); #displaying boxplot
#printing the GICS sectors included in each cluster
print('Cluster 1:\n', stocks[stocks['hier_cluster'] == 0]['GICS Sector'].unique())
print('Cluster 2:\n', stocks[stocks['hier_cluster'] == 1]['GICS Sector'].unique())
print('Cluster 3:\n', stocks[stocks['hier_cluster'] == 2]['GICS Sector'].unique())
print('Cluster 4:\n', stocks[stocks['hier_cluster'] == 3]['GICS Sector'].unique())
print('Cluster 5:\n', stocks[stocks['hier_cluster'] == 4]['GICS Sector'].unique())
print('Cluster 6:\n', stocks[stocks['hier_cluster'] == 5]['GICS Sector'].unique())
print('Cluster 7:\n', stocks[stocks['hier_cluster'] == 6]['GICS Sector'].unique())
Cluster 1: ['Energy'] Cluster 2: ['Financials' 'Information Technology'] Cluster 3: ['Health Care' 'Consumer Discretionary' 'Information Technology'] Cluster 4: ['Information Technology'] Cluster 5: ['Consumer Discretionary'] Cluster 6: ['Information Technology'] Cluster 7: ['Industrials' 'Health Care' 'Information Technology' 'Consumer Staples' 'Utilities' 'Financials' 'Real Estate' 'Materials' 'Consumer Discretionary' 'Energy' 'Telecommunications Services']
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Cluster 5:
Cluster 6:
Cluster 7:
Similarities in Both Techniques
Differences in Both Techniques
Similarities in Cluster Profiles\ K-means Cluster 3, K-means Cluster 7, and Hierarchical Cluster 1
K-means Cluster 6 and Hierarchical Cluster 4
K-means Cluster 1 and Hierarchical Cluster 2
K-means Cluster 5 and Hierarchical Cluster 5
K-means Cluster 4 and Hierarchical Cluster 6
K-means Cluster 2 and Hierarchical Cluster 7
Differences in Cluster Profiles\ The hierarchical cluster profile for cluster 3 really did not fit anywhere because its combination of characteristics was too different from the other clusters. There are a higher or broader range of current prices, a higher and more positive percent price change, moderate volatility, and high P/E ratio.
According to the cluster profiles provided:
In conclusion, the most profitable or stable stocks to invest in are considered to be: